A Pilot Study
University of Kentucky
2024-10-25
Dr. Anton Vinogradov,(recent!) PhD, Computer Science
Database of Etymological Roots Beginning in PIE
In DERBi PIE, having an integrated semantic system could provide automated answers to questions such as:
Are certain sound sequences associated with certain meanings or semantic spheres?
Morphological classes or derivations?
How have meanings changed over time into the various branches and daughter languages?
Distributional Analysis: “You shall know a word by the company it keeps.” - John Rupert Firth
For example, you are very likely to see the words “dog” and “leash” appear near each other in a text.
However, you are less likely to see the words “dog” and “physics” as near each other.
There exist programming tools that allow us to generate vectors (essentially coordinates) based on a word’s position within a text and proximity to other words.
All of the vectors that can be generated from a text would lie in a semantic space or hyperspace.
This approach in line with what is done in present-day NLP – identifying semantic relationships through word embeddings.
Before we can run a text through one of these tools, we must first discuss tokenization and lemmatization.
You can tinker with how exactly a word embeddings tool generates these vectors, for example:
How many dimensions will it generate for each vector?
How far to each side of a word will it look?
How many times will it run through the text?
The closer two words lie to each other in this space, the closer in semantic value they are.
Fancy formula from Wikipedia
Models trained on French & Spanish Wikipedia articles
Words removed (7.3% in French, 6.6% in Spanish)
Remaining words lemmatized, furthering reducing vocabulary by roughly 10%
1a. If there’s a 1:1 correspondence between the Romance language & Latin: language-word center
1b. If there isn’t a 1:1 correspondence, identify the centroid of lemma’s vector: language-word center
One of these things is not like the other!
We believe that our reconstruction model shows promise
We plan to stick with the Descendant Model strategy, but:
Utilize LLMs (such as GPT) for modelling (hyperspace) for greater precision and differentiation of polysemy
Instead of Google Translate, use bilingual dictionary or LLM
Add additional Romance languages;
When happy with results, move on to other subbranches (likely Slavic or Indic) for testing;
We believe that a good, workable model should be able to generate a *Proto-vector through the centroid of the language-word centers (minus those additional steps).
Let’s finish with where we began – DERBi PIE
With aligned hyperspaces generated for each descendant and reconstructed language, our models would give us coordinates for each lexeme in a language
Possible question: “How semantically similar are PIE *sC- roots – as compared to related *C- roots?” (*(s)peḱ- ‘look at’)
Example from English:
bl- words: “blue”, “blaze”, “bland”, “blush”, “blink”, “blow”, “blast”, “blot”, “blend”, “bleak”
These words have a cosine similarity score of .8016 (remember: perfect score is 1.0!)